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The Intemet is rapidly becoming a key method for communication and the dissemination 


of documents and ideas. There are many reasons why this is so: 


* 


Email combines virtually instant transmission of information found in electronic 
communication means such as the telephone or television, with the asynchronous delivery 
found in a written letter. It is easy to scale the delivery from a single recipient to hundreds 
or even more through mailing lists. This combination is very attractive, and for many 
users has already displaced both the telephone and paper based letters. 


Web pages provide a very scalable method of publishing information to a wide audience. 
A web page can economically reach a very small audience, because it is inexpensive to 
create. Yet, should the site attract a large audience, it is easy and cheap to scale the 
delivery up to the point. 


Bulletin boards and news groups provide forums for discussion and interaction between 
multiple participants. Prior electronic communications media, such as the telephone 
excelled at one-to-one communication; radio and television are best at one-to-many 
(where many is typically measured in millions) communication. None of these prior 
electronic forms could efficiently handle interaction between small ad hoc groups of 
people - such as found in conferences, meetings, or less formal groups of people. 


Chat services provide direct, synchronous communications, which includes a complete 
log. 


Indexing and cataloging services allow easy access to information across the entire net. 


All of these aspects of the Intemet are remarkably cheap, both in the absolute, and in 
comparison with other media. 


These properties make the Intemet a tremendous information resource. Technological 


trends suggest that the Internet will get a variety of new capabilities over time, such as the ability 
to easily deal with high quality video. The Internet is about all you could ask of an information 
resource. 


Except one thing: the Internet is not naturally archival. 


Paper based publishing or information storage creates as a necessary by product, an 


archival record that can last for hundreds of years, if stored properly. In addition to the physical 
properties of the media, there is cultural support for archiving the material. A large 
infrastructure - the library system - has been set up to maintain an archival record of most paper 
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based publishing (books, magazines and newspapers), and make it accessible to nearly any citizen. 
Indeed, there are legal requirements to archive a copy of most kinds of printed material ina rine 
(such as the Library of Congress or British Museum Library) in order to fully enjoy the benefits o 
copyright protection. Many private documents that are not published are nevertheless, stored 

as a matter of course. Examples include business records, government records, private 


correspondence etc. Some of these have legal requirements, others are simply stored out of 
tradition. 


In contrast to this situation, Internet based information creates no naturally archival record 
as a natural by product. Virtually all web sites and bulletin boards reside on hard disks - which 
are read-write storage devices. These are routinely overwritten and reused. In addition, 
technological improvement leads to the rapid obsolesce of the computers and hard disks on which 


data on net is stored. Older material is not always moved over to new machines when this 
occurs. 


It is technologically feasible to create archival forms of the data on the Internet, but only if 
you deliberately set out to do so. There is no cultural or societal imperative to create archival 
material in the Internet community, whether in libraries or privately. In fact, just the opposite 
attitude exists. The rapid pace of technological change has created a cultural expectation of rapid 
update to the "latest and greatest" version of any sort of information. 


As a result, the early days of the Intemet may well appear to future historians as a 
pre-literate society, for none of it will have remained for them to study. This lack of material will 
impede both the study of the Internet itself, as well as the study of any societal or cultural 
trends that rely on the Intemet. Today this is a substantial list, but within the next several 
decades, it will encompass nearly any aspect of our lives. The paper based historical record will 


be increasingly insufficient, because it will record a declining portion of the 
historical record. 


The altemative is to archive the Internet. We can create archival copies of every publicly 
accessible site on the net. Imagine snapshots of the net stored as a sort of time capsule. A 
future historian could browse or search the entire Internet, as of 10AM, February 11, 1997. 
Every Usenet newsgroup and every web site would be present. 


Such an archive might seem prohibitively expensive - equal in cost to the entire net - but 


using digital tape it can be captured very cheaply. In that form it would be difficult to access, but 
at least it would be stored for the future. As computer storage technology continues to 


improve, it will become increasingly easy to put the snapshot on line. Within 10 years, the 


snapshot of the net as of February 11, 1997 will be quite small compared to the Intemet circa 
2007 - probably just a few percent of it. 


Future historians will use advanced information retrieval software to explore, browse and 
correlate information in such a snapshot. They would be able to see the Internet as somebody in 


1997 would, or if they chose other views, be able to perform all sorts of analytical studies 
treating the entire snapshot as a corpus. Want to find out who started a particular idea, rumor or 
trend? Search for the first occurrence, then find all related instances. How prevalent was 
discussion of the presidential election on the net in 2000 versus 1996? Run cross 

comparisons by searching and cataloging sites from snapshot of both years. Traditional historical 


analysis will be possible, but so will many other new methodologies that are enabled by the 
information retrieval software. 


A series of such snapshots will in the long run be an invaluable record for future historians. 
Without deliberate archiving, the Intemet information will be lost, and the historical record of the 
society based on it is therefore fragmented and incomplete. However, once you do 


archive the net, the situation reverses because you get a historical record that is larger and 
detailed than any that have been available. 


Fields other than history would find the archive useful as well. Linguists and 
lexicographers would find the corpus invaluable as a study of the evolution of language. 
Economists will be able to track the proliferation of web based commerce. Social scientists will 
get a vast store of information on popular culture. People in all walks of life use public libraries 
to access information in old newspapers, magazines and books - whether for school projects, 


nostalgia or whatever reason they desire. If people have a desire to access the past today - why 
won't they have similar desires in the future? 


; - Digital tape and other off 
line means seem to be the cheapest means to store such an archive until a combination of funding 


and technology improvement make it feasible to place the archive online. Hopefully this can be 


as soon as possible, but given the long term nature of the project, priority must be given to 
capturing the data and saving it for posterity. 


The archiving can be done periodically, as a series of snapshots taken at a specific time (or 
over some small interval). An improvement on this scheme would be to supplement the periodic 
snapshots of the entire net with incremental changes taken over an even more frequent period. 


Existing web crawlers used by search and indexing sites are capable of capturing this information 
today. 


In the future, it is likely that new standards will be created to improve the efficiency of 
indexing, making the web crawler’s task easier. As that occurs, the needs of archiving should be 
incorporated. Change notification standards, which automatically note changes in sites, may make 
this task even easier in the future. In fact, standards might one day exist which are directly 
designed to make archiving easier. However, there is no point in waiting until then - a 
very serviceable archive can be captured today. 


_anost of which'I'find rather dubious: The most:common, suggestion is to edit the record and only . 


Mi = » The cliche of 
separating the "wheat" from the "chaff" is frequently invoked. 


The lesson from archeology is quite clear - people are rather poor at making this 
determination. Archeologists rely more on trash dumps, refuse pits and toilet areas than they do 
on the "official" record. The things which people explicitly discarded as unimportant at the time 
is often the most useful in reconstructing day to day life, or in learning what the typical person 
was doing. The elite of any society is interested in recording its actions - through burials, 
temples, monuments and written records. Society as a whole tends to be underrepresented, 
because it lacks the resources to store everything. 


It is hard to second-guess what technology, or what interests people will have in the 
future. The study of AIDS was greatly helped by stored samples of blood. Such a sample 
allowed the first recorded case to be traced to a British sailor who died of a mysterious illness in 
1952. With more samples, particularly in Africa, we would almost certainly know a great deal 
more about the disease and its origin. This case is not atypical - analysis DNA and a host of other 
technologies are helping piece together mysteries such as the social structure of Easter Island, 

"Otzi" the paleolithic man found frozen in the Alps, or the mummies of Incan children from the 
Andes. They certainly had no clue at the time what technology we would be using now. In fact, 
had the remains been discovered a decade or two ago, the scientists involved in the discovery 
might not have guessed that PCR, magnetic resonance imaging, CAT scans and computer analysis 
would become routine tools. 


Future information retrieval software, which is able to parse and understand human 
languages, could revolutionize the study of a large corpus of text, but we will not be able to apply 
such techniques to the historical record unless it is saved. 


The interests of historians are equally hard to guess up front. It is easy to say that 
historians would be interested in the Internet postings done by Presidents of the United States, 
Nobel Prize winners, diplomats, writers and the leading intellectuals of our time. Future 
occupants of these lofty positions are on the net today - as children or undiscovered 
adults - making web sites, posting to newsgroups and creating a record that will only be 
interesting in retrospect. The same is true of political movements and organizations, businesses 
and social trends. We cannot know what to save until we know what is important - not important 
now, but in the future. 


The net is a very organic and connected community. A new trend, idea or bit of urban 
folklore can start one place, then rapidly spread. Although any one user may visit a very 
restricted set of sites, the users as a community have a great deal of overlap. Search tools can - 


by design or accidentally - expose anybody to anything. There is a short chain of connection 


between any site and any user. Any division of the net into "historically important" or "wheat" 
sites versus the unnecessary chaff is artificial. 


I believe that it is incredibly dangerous to second guess future generations, and edit the 
historical record. We should archive all of the net that we possibly can. Ironically, it is probably 
cheaper and easier to store it all. Digital tape is cheap. Human time to categorize and edit is 


expensive by comparison. Leave the editing and selection for future generations - or their 
software agents. 


